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Abstract 

We discuss a general method to learn data representations from multiple tasks. We provide 
a justification for this method in both settings of multitask learning and learning-to-learn. 
The method is illustrated in detail in the special case of linear feature learning. Conditions 
on the theoretical advantage offered by multitask representation learning over independent 
task learning are established. In particular, focusing on the important example of half-space 
learning, we derive the regime in which multitask representation learning is beneficial over 
independent task learning, as a function of the sample size, the number of tasks and the 
intrinsic data dimensionality. Other potential applications of our results include multitask 
feature learning in reproducing kernel Hilbert spaces and multilayer, deep networks. 

Keywords: learning-to-learn, multitask learning, representation learning, statistical 

learning theory, transfer learning 


1. Introduction 


Multitask learning (MTL) can be characterized as the problem of learning multiple tasks 
jointly, as opposed to learning each task in isolation. This problem is becoming increasingly 
important due to its relevance in many applications, ranging from modelling users’ prefer¬ 
ences for products, to multiple object classification in computer vision, to patient healthcare 
data analysis in health informatics, to mention but a few. Multitask learning algorithms 
which exploit structure and similarities across different learning problems have been stud¬ 
ied by the machine le arning community since the m i d 90’s, initially in conne ction to neural 
network models (see Baxter, 200111 : Caruana . 1998 : Thrun and Pratt . 1998, and reference 
there in). More recent approaches have been base d on kernel methods (|Evgeniou et al 


2OO5I ) ■ structured sparsity and convex optimization ( Argvriou et ah . 20081 ). among others. 


Closely related to multitask learning but more challenging is the problem of learning-to- 
learn (LTL), namely learning to perform a new task by exploiting knowledge acquired when 
solving previous tasks. Arguably, a solution to this problem would have major impact in 
Artificial Intelligence as we could build machines which learn from experience to perform 
new tasks, similar to what we observe in human behavior. 
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An influential line of research on multitask and transfer learning is based on the idea that 
the tasks are related by means of a common low dimensional representation, w hich is l e arned 
jointly with the tasks’ parameters. T his approach was first advocated i n (IBaxted. l2nnC : 


Caruana . 19981 : Thrun and Pratt . 1998h and more recently reconsidered in ( Argvriou et al. 
20081 ) from the perspective of convex optimization and sparsity regularization. Representa¬ 


tion learning is also a key problem in AI, and in the past years there has been much renewed 
interest in learning nonlinear hierarchical representations from multiple tasks using multi¬ 
layer, deep networks. Researchers have shown improved results in a num ber of empirical 


doma ins: the case of computer vision is perhaps most remarkable, (see e.g. iGirshick et al 


2014 . and references therein). This success has increased interest in multitask representa¬ 


tion learning (MTRL) as it is a core component of deep networks. Still, the understanding 
of why this methodology works remains largely unexplored. 

In this paper we analyze a general method for MTRL and discuss its potential advantage 
in both the MTL setting, where the learned representation is applied to the same tasks used 
during training, and in the domain of LTL, where the representation is applied to new tasks. 
We derive upper bounds on the error of these methods and quantify their advantage over 
independent task learning. When the original data representation is high dimensional and 
the number of examples provided to solve a regression or classification problem is limited, 
any learning algorithm which does not use any sort of prior knowledge will perform poorly 
because there is not enough data to reliably estimate the model parameters. We make this 
statement precise by considering the example of half space learning. 


1.1 Previous Work 


Many papers have p roposed multitask learn i ng methods and studied their app l ications to 
specihc problems (see 


2003; 

20131 : 


Caruana. 1998: 


Ando and Zhang . 2005 : Argvriou et ah . 2008 : Baxter . 20001: Ben-David and Schuller . 


Cavallanti et al 


Pentina and LampertJ . l2014l : IWidmer et al.l . 120131 . and references therein). There is a 


201ol: Kuzborskii and Orabona. 2oU Maurer et al 


vast literature on these subjects and the list of papers provided here is necessarily incom¬ 
plete. 

Despite the considerable success of multitask learning and in particular multitask rep¬ 
resenta t ion learning there are only few theor etical investigations ( Ando and Zhang . 20051 : 
Baxter . 200d : Ben-David and Schuller . 2003 ). Other statistical lea r ning bounds a re re¬ 


stricted to line ar multitask learning such as (|Cavallanti et al 
Maurer , 2006al 3) • 


2010l : iLounici et al.l . 12011 


Learn ing-to-learn (also c alled hrductive bias learning or trar rsfer lea r ning) has been pro¬ 
posed by Thrun and Pratt ( 19981 ) and theoretically studied by Baxter ( 2000l ) where an er¬ 
ror analysis is provided, showing that a common representation which performs well on the 
training tasks will also generalize to new tasks obtained from the same “er rvironment”. M ore 
recent papers which pres e nt dimension i ndepe n dent bounds appear in Maurer! ( 2006a b); 
Maurer and Poirtil ( 20131 ): Marrrer et al. ( 20131 ): Pentina and Lampert ( 20141 ). 


1.2 Our Contributions 

There are two main contributions of this work. First we present bounds to both the MTL 
and LTL settings, which apply to a very general MTRL method. Our analysis goes well 
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beyond linear representation , learning considered in most previous works. It improves over 
the analysis by Baxter ( 2000l i based on covering numbers. We use more recent techniques of 
empirical process theory to achieve bounds which are independent of the input dimension 
(hence also valid in reproducing kernel Hilbert spaces) and to avoid logarithmic factors. 
Furthermore our analysis can be made fully data dependent. When specialized to subspace 
learning (i.e. linear feature learning) we get best bounds valid for infinite dimensional input 
spaces. 

As the second main contribution of this paper, we explain the advantage of MTRL 
in terms of specihcity of feature maps and expose conditions when MTRL is beneficial 
or when it is not worth the effort. We further specialize our upper bounds to half-space 
learning (noiseless binary classification) and compare them to a general lower bound for 
learning isolated tasks. We observe that if the number of tasks grows then the performance 
of the method (both in the MTL and LTL setting) matches the performance of square 
norm regularization with best a priori known representation. This analysis highlights the 
advantage of multitask learning over learning the tasks independently. We also present 
numerical experiments for half-space learning, which indicate the good agreement between 
theory and experiments. 


1.3 Organization 

The paper is organized as follows. In Section [21 we introduce the problem and present our 
main results. In Section [3l we specialize these results to subspace learning and illustrate 
the role played by the data covariance matrices in our bounds. In Section 13.11 we further 
illustrate our results in the case of half-space learning, rigorously comparing our upper 
bounds to a general lower bound for orthogonal equivariant algorithms. In Section 01 we 
present the proof of our main results, developing in particular uniform bounds on the 
estimation error. Finally, in Section [5] we summarize our findings and suggest directions for 
future research. 


2. Multitask Representation Learning 

The set of possible observations is denoted by Z = (A,M), where the members of X are 
interpreted as inputs and the members of M are interpreted as outputs, or labels. A learning 
task is modelled by a probability measure ^ on Z where /r (x, y) is the probability to 
encounter the input-output pair {x,y) G Z in the context of task y. We want to learn how 
to predict outputs. If we predict y while the true output is y', we suffer a loss £ {y, y'), where 
the loss function : M x M ^ [0,1] is assumed to be 1-Lipschitz in the hrst argument for 
every value of the second argument. Different Lipschitz constants can be absorbed in the 
scaling of the predictors and different ranges than [0,1] can be handled by a simple scaling 
of our results. 

If 5 is a real function defined on X, then the values g (x) can be interpreted as predictors 
and the expectation id (Ai), Y)] is the risk associated with hypothesis g on the 

task g. 

Multitask learning simultaneously considers many tasks /r^,..., and hopes to exploit 
some suspected common property of these tasks. For the purpose of this paper this property 
is the existence of a representation or common feature-map, which simultaneously simplifies 
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the learning problem for most, or all of the tasks at hand. We consider predictors g which 
factorize 

g = f°h, 

where “o” stands for functional composition, that is, {foh){x) = f {h (x)), for every x G X. 
The function /i : T —)• is called the representation, or feature-map, and it is used across 
different tasks, while / is a function defined on M^, a predictor specialized to the task at 
hand. In the sequel K will always be the dimension of the representation space. 

As usual in learning theory the functions h : T —)• and / : —)• R are chosen from 

respective hypothesis classes H and T, which we refer to as the class of representations and 
the class of specialized predictors, respectively. These classes can be quite general, but we 
require that the functions in T have Lipschitz constant at most L, for some positive real 
number L. 

The choice of representation and specialized predictors is based on the data observed 
for all the tasks. This data takes the form of a multi-sample Z = (Zi,..., Z-r), with Z^ = 
{Zti ,..., Ztn) ~ Here and in the sequel an exponent on a measure indicates a product 
measure, so that is a measure on and Z^ is an iid sample of n random variables 
distributed as We also write Zu = (Xu, Yu), Zt = (X*, Y*) and Z = (X, Y). 

Multitask representation learning (MTRL) solves the optimization problem 

i{ft{h(Xu)),Yu)-.hGn, (/ i ,...,/ t ) ( 1 ) 

I t=i i=i ) 

In this paper, we are not concerned with the algorithmics of this problem, but rather with 
the statistical properties of its solutions h and fi,..., /t- Note that these are functional 
random variables in their dependence on Z. 

We consider two possible applications of these solutions. One application, which we will 
refer to as multitask learning (MTL), retains both the representation h and the specializa¬ 
tions fi,... ,/t to be applied to the tasks at hand. The other, perhaps more important, 
application assumes that the tasks are related by a probabilistic law, called an envi¬ 
ronment, and keeps only the representation h to be used when specializing to new tasks 
obeying the same law. In this way the parametrization of a learning algorithm is learned, 
hence the name “learning-to-learn” (LTL). 

We will give general statistical guarantees in both cases. Our bounds consist of three 
terms. The first term can be interpreted as the cost of estimating the representation h and 
decreases with the number T of tasks available for training. The second term corresponds to 
the cost of estimating task-specific predictors and decreases with the number n of training 
examples available for each task. The last term contains the confidence parameter and 
typically makes only a very small contribution. 

It is not surprising that the complexity of the representation class T-L (first term in 
the bounds) plays a central role. We measure this complexity on the observed input data 
X G Dehne a random set (X) C R^^” by 

n{x.) ={{hk (Xu)): hen}. 
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The complexity measure relevant to estimation of the representation is the Gaussian average 


G(n (X)) =E 


supV'Tfcti^fc {^ti) 


\Xt. 


( 2 ) 


where the are independent standard normal variables. The Gaussian average is of order 
V nT in T and n for many classes of interest. These include kernel machines with Lipschitz 
kernels (e.g. Gaussian RBF) and arbitrarily deep compositions thereof, see Maurer ( 20141 ) 
for a discussion. As we shall see, this increase of 0{y/nT) is compensated in our bounds 
and the cost of learning the representation vanishes in the multi-task limit T 
The second term in the bounds is governed by the quantity 


oo. 



or an equivalent distribution-dependent expression. If the feature-maps in T-L are very 
specific, in the sense that their components are appreciably different from zero only for very 
special data, the quantity in ([3]) can become much smaller than Xj a phenomenon which 
can give a considerable competitive edge to MTRL, in particular if the per-task sample size 
n is small. We will demonstrate this in Section [3l where we apply Theorems [U and [2] to 
subspace-learning and show that the above quantity is related to the operator norm of the 
data covariance. 


2.1 Bounding the Excess Task-averaged Risk (MTL) 

If we make no further assumptions on the generation of the task-measures a 

conceptually simple performance measure for a representation h and specialized predictors 
/i,... , /t is the task-averaged risk 


Tavg = Ut (h (X)), Y ). 

1=1 

We want to compare this to the very best we can do using the classes % and F, given 
complete knowledge of the distributions ..., firp. The minimal risk is clearly 

It is a fundamental hope underlying our approach that the classes T-L and F are large enough 
for this quantity to be sufficiently small for practical purposes. We use the words “hope” 
and “belief’ because an “assumption” would imply a statement to be used in analytical 
reasoning. Instead our approach is agnostic, and our results are valid independent of the 
size of the minimal risk above. 

Our first result bounds the excess average risk, which measures the difference between 
the task-averaged true risk of the solutions to ([T]) and the theoretical optimum above. 
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Theorem 1 Let ..., % and T be as above, and assume 0 gLL and / (0) = 0 for all 

f £ J-. Then for <5 > 0 with probability at least 1 — S in the draw of Z r\j nLi we have 
that 

favg(^/l,...,/T)-Cg 

ciLG(^(X)) C 2 Qsup;,g^||/.(X)|| / 81n(4/^) 

- nT ^ nVr V nT ' 

where ci and C 2 are universal constants, G{'HpL)) is the Gaussian average in Equation 
and Q is the quantity 


Q = Q{E) sup II— —j- 

\\y ~ y 


j-E sup -fi (/ {yi) - f (y-)) 


( 4 ) 


Remarks: 

1. The assumptions 0 G H and / (0) = 0 for all / G are made to give the result a 
simpler appearance. They are not essential, as the reader can verify from the proof. 

2. If G {TL (x)) is of order y/nT then the first term on the right hand side above is of 
order lj\/Tn and vanishes in the multi-task limit T —>• oo even for small values of n. 

3. For reasonable classes J- one can find a bound on Q, which is independent of n, 
because the ||y — y^|| in the denominator balances the Gaussian average depending on 
the class T. 

4. The quantity sup/^ \\h (X) || is of order V nT whenever TL is uniformly bounded, a crude 
bound being y/nT sup/j^-^ maxt* \\h {xu)\\. The second term is thus typically of order 
Xjypn. As explained in the discussion of Equation ([3]) above it can be very small if 
the representation components in TL are very data-specific. 


2.2 Bounding the Excess Risk for Learning-to-learn (LTL) 

Now we consider the case where we only retain the representation h obtained from ([1]) and 
specialize it to future, hitherto unknown tasks. This is of cours e only possibl e, if there is 
some common law underlying the generation of tasks. Following Baxter ( 200(]l i we suppose 
that the tasks originate in a common environment y, which is by definition a probability 
measnre on the set of probability measures on Z. The draw of /x ~ y models the enconnter 
of a learning task y in the environment r/. 

The environment y induces a measure on Z by 


. (A) = E ,\u lAll for 


A cr 


This simple mixture plays an important role in the interpretation of our results. 

The measure y also induces a measure on Z"^ which corresponds to the draw of an 
n-sample from a random task in the environment. To draw a sample Z gZ"^ from we first 
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draw a task // from rj and then generate the sample Z = (Zi,..., Zt) from n independent 
draws from /r. Formally 


(A) = [/r- (^)] for A C Z". 

We assume that the tasks pi,... ,pj’ are drawn independently from p and, consequently, 
that the multisample Z = (Zi,..., T-^) is obtained in T independent draws from p^, that 
is, Z 

The way we plan to use a representation /i S ^ on a new task /r ~ r/ is as follows: we 
draw a training sample Z = (Zi,..., Z„) from and solve the optimization problem 


1 

min — 
/6T n 


i=l 


Let fh^z denote the minimizer and nih^z the corresponding minimum. We will then use the 
hypothesis a {h)^ = fh,z °h = fh,z{h{-)) for the new task. In this way any representation 
h ^ 71 parametrizes a learning algorithm, which is a function a{h) : J- o h, dehned, 

for every Z as 

a{h)z = fh,z o h. 


In this sense the problem of optimizing such a representation can properly be called “ learning - 
to-lea rn”. It can also be interpreted as “learning a hypothesis space” as in (|Baxteii . 
200C|I L namely selecting a hypothesis space J- o h from the collection of hypothesis spaces 


{Xoh:he7i}. 

We can test the algorithm a (h) on the environment p in the following way: 


• we draw a task p 

• we draw a sample Z £ Z” from p^, 

• we run the algorithm to obtain a{h)r^ = fh,z ° h, 

• hnally, we measure the loss of a (h)^ on a random data-point Z = (X, Y) ^ p. 

To define the risk (h) associated with the algorithm a (h) parametrized by h we just 
replace all random draws with corresponding expectations, so 


Sr] (h) — E^.....j;Ez~^"E(x,y)~/x (« {h)^ (X), T)] . 


The best value for any representation h in a (h ), given complete knowledge of the environ¬ 
ment, is then 

ininTr, (h). 
hGH ' 


But, given complete knowledge of the environment, this is still not the best we can do 
using the classes X and 7i, because for given p and h we still use the expected performance 
Ezoo^nEz...,^ I (a (/i )2 (X), Y) of the empirical risk minimization algorithm a (/i), instead of 
using knowledge of p to replace it by minj-gj-Ez~^^ (/ {h (X)), Y). The very best we can 
do is thus 


£* = min 
^ hen 


mmEz^],£{f{h{X)),Y) 
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The excess risk associated with any representation h is thus 

We give the following bound for the excess risk associated with the representation h found 
as solution to the optimization problem ([T]). 

Theorem 2 Let rj he an environment on Z and % and T as above. Then: (i) with proba¬ 
bility at least 1 — 6 in the draw o/ Z ~ 


' T^/n 


sup 

h&H 




E, 




\\h{X)r 


+ 


n 


8 In (4/5) 


and (a) with the same probability 


£^{h) 


G(n (X)) 


V^Q^(l/r)EtSupfeg^||fe(X,)|| ^ g / ln(8/J) 
n V T 


where h is solution to the problem m, a{n (X)) is the Gaussian average introduced in 
m, and Q' is the quantity 


Q' = Q'(X) 


sup 

j/eiR^'*\{o} 


E sup V jJ {yi). 


( 5 ) 


We make some remarks and comparison to the previous result. 

1. The constants are now explicit and small. For Theorem [H uniform estimation had 
to be controlled simultaneously in TL and while for LTL the problem can be more 
easily decoupled. 

2. The first term is equivalent to the first term in Theorem [1] except for y/n replacing n 
in the denominator. It is therefore typically of order IjVT instead of 1/y/nT. The 
different order is due to the estimation of a hitherto unknown task, for which the 
sample sizes are irrelevant. To understand this point assume the r/ has the property 
that every /i ~ r/ is deterministic, that is supported on a single point G Z. Then 
clearly the sample size n is irrelevant, and the problem becomes equivalent to learning 
a single task with a sample of size T. 

3. The quantity Q' is very much like the quantity Q in Equation Q, and it is uniformly 
bounded in n for the classes we consider. For linear classes Q = Q'. 

4. The bound in part (i) is not fully data-dependent, but more convenient for our appli¬ 
cations below. The quantity 


sup jE(^x,Y)^n \\h{X)f = sup 

h£H '' ^ /iGW 



hk {xf 
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plays a similar role to ([3]), which is its empirical counterpart. Again, if the features 
are very specific, as the dictionary atoms of the next section or the atoms in a radial 
basis function network, then the above quantity can become very small. 


2.3 Comparison to Previous Bounds 


The fi rst and most important theoretical study of MTL and LTL was carried out by [Baxter 
( 2 OOC 1 I I . where sample complexity bounds are given for both settings. Instead of a feature 
map a hypothesis space is selected from a class of hypothesis spaces. Clearly every feature 
map with values in defines a hypothesis space while the reverse is not true in general, 
so Baxter’s setting is cer tainly inore g eneral than ours. On the other hand the practical 
applications discussed in ( Baxter! . I 2 OOC 1 I can be cast in the language of feature learning. 

To prove his sample complexity bounds Baxter uses covering numbers. This classical 
method requires to cover a (meta-)hypothesis space (or its evaluation on a sample) with a 
set of balls in an appropriately chosen metric. The uniform bound is then obtained as a 
union bound over the cover and bounds valid on the individual balls. The latter bounds 
follow from Lipschitz properties L of the loss function relative to the chosen metric. For 
a bound of order e the radius of the balls has to be of order e/L. This leads to covering 
nu mbers o f order e~'^, where d is some exponent (see the last inequalities in the proof of 
in (Baxter, 20001 ). and has the consequence that the dominant term in the bound has an 
adc litional factor of In (1/e). This is manifest in Theorem 8, Theorem 12 and Corollary 13 
in ( Baxter . 2000l l and constitutes an essential weakness of the method of covering numbers. 


For bounds on the excess risk it implies that the orders of yM/T and y^l/n obtained from 
Rademacher or Gaussian complexities have to be replaced by y^ln (T) /T and y^ln (n) /n. 

Rademacher and Gaussian complexities make it easy to handle inhnite dimensional input 
spaces (see our Theorems [Hand [5] below). They also lead to data dependent bounds, which 
allows us to explain the benefits of multi-task learning in terms of the spectrum of the data 
covariance operator and the effective input dimension. Bounding Gaussian complexities for 
linear classes is comparatively simple, see the proof of our Lemma [3j There is a wealth 
of recent literature 011 the Rademacher complexity of matrices with spectral regularizers 


(see e.g. Kakade et ah . 20121 : Maurer and Pontil . 20131 . and references therein), while i 


it is 

unclear to us how Baxter’s method could be applied if the feature map is constrained by a 
bound on, say, the trace norm of the associated matrix. In the case of LTL, our approach 
also leads to explicit and small constant factors. 

On the other hand it must be admitted, that it is relatively easy to obtain bounds (also 
provided by Baxter) of order In (n) /n or In (T) /T with covering numbers in the realizable 
case. Such bounds would be more difficult to obtain with our techniques. 


The work of lAndo and Zhang! (12005! ) proposes the use of MTL as a method of semi- 
supervised learning through the creation of artificial tasks from unlabelled data, for example 
predicting concealed components of vectors. They analyze a specific algorithm where the 
class of feature maps can be seen as a linear mixture of a fixed feature map with subspace 
projections as discussed in our paper. The bounds given apply to the task-averaged risk 
and not to LTL. The analysis is based on Rademacher averages and is independent of the 
input dimension. The bound itself is expressed as an entropy integral as given by Dudley 
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(see e.g. Van Per Vaart and Wellnei . 19961 ) but it is not very explicit. In particular the role 
of the spectrum of the data covariance is not apparent. 


3. Multi-task Subspace Learning 

We illustrate the general results of the previous section with an important special case. We 
assume that the input space V is a bounded subset of a Hilbert space H, which could for 
example be a reproducing kernel Hilbert space. We denote by (•, •) the inner product in H 
and by || • || the induced norm. We hope that sufficiently good results can be obtained by 
predictors of the form g, where g : H ^ M is linear with bounded norm. We also suspect 
that only few linear features in H suffice for most tasks, so that the vectors defining the 
hypotheses g can all be chosen from one and the same, albeit unknown, itT-dimensional 
subspace M of H. 

Consequently we will factorize predictors as f o h, where /i is a partial isometry h : 
H —>■ and / is a linear functional on chosen from some ball of bounded radius. 

Specifically, we introduce the classes 

% = 9 X (((ii,a;) ,..., G H = (di,..., G orthonormal} 

F = < 9 y ^ WkVk G K : '^wl < 

[ k k 



The D’s appearing i n the definition of T-L are also called dictionaries and the individual dk 
are called atoms (see Maurer et ah . 2013h . 

It does no harm to our analysis if we immediately generalize the class T-L so as to 
include certain two-layer neural networks by allowing a nonlinear activation function (p 
with Lipschitz constant and satisfying cp (0) = 0, to be applied with each atom. We can 
also drop the condition of orthonormality and allow the atoms to trade some of their norms 
when needed. The enlarged class of representations is 


TL = <x e H {p{{di,x )),.. .,p{{dK,x))) G : di,...,d/^ G id, ^ ||dfc|p < K 

[ k 

The results can then be re-specialized to subspace learning by setting p to the identity and 
to one. 

When applied to subspace learning, our bounds are expressed in terms of covariances. 
If u is a probability measure on H the corresponding covariance operator C^, is defined by 

{CuV, w) = (u, X) {X, w) for v,w G H. 



For an environment g we denote the covariance operator corresponding to the data-marginal 
of the mixtnre measure simply by C. 

If X = (xi,..., Xm ) G we define the empirical covariance operator C (x) by 




1 

m 


E 


{v,Xi) {Xi,w) 


for v,w G H, 
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in particular 

(C (X) = {v, Xu) {Xu, w) . 

ti 

The following lemma establishes the necessary ingredients for the application of Theo¬ 
rems [Hand [2] to the case of subspace learning. Recall that if A is a selfadjoint positive linear 
operator on H, we denote by ||^||oo and ||A||i its spectral and trace norms, respectively. 
They are defined as ||A||oo = sup||^||<i \\Az\\ and \\A\\i = Aa), where {eJigN is an 

orthonormal basis in H. Recall also the definition of Q {J-) and Q' {J-) given in Equations 
(|^ and Q, respectively. 


Lemma 3 Let x = {xu) be aT x n matrix with values in a Hilbert spaee and let cj), H and 
T be defined as above. Then 


(i) G {H {5c)) < L^K^nT C {5c) 

(a) For every h gH, \\h (x)|| < KnT C (x) 

(Hi) For an environment 77 and every h € H 

E^x,Y)^^J\h{X)f<LlK\\C\\^. 

(iv) L {H) < B. 

(v) Q{T)<B and Q' {T) < B. 

Proof (i) Using the contraction lemma, Corollary 1111 in the first inequality and Cauchy- 
Schwarz and Jensen’s inequality in the second we get 


G {H (x)) < LfiK sup V -ikti {dk , Xu) 

sup dk,'^lktiXti 
k \ ti 


d&n‘ 


< L^^/K 


'^ItiXti 


1/2 


1/2 


< L^K iJ2\\xti\n =L^KJnT C (x) 


(a) For any D gH 


^(P{{dk,xu))^ < LlY, {dk^ Xti) 


kti 


kti 


< lIk sup '^{v,xu)‘^ 


r 


= LiKnT 


C(x) 
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where we used (/> (0) = 0 in the first step. 
(in) Similarly, we have that 


< L\K sup 
lhll<i 

= lIk\\c\\^. 

(iv) Let y,y' € Then 


sup 

w&T 





SO L < B. 

(v) Similarly, we have that 


E sup .E7. E ^kUki ^ ^ '^kVki 
„• V u j 


E sup E Wk ^ 7i {Vki - Vki) 

w&T . 


SO Q < B. The same proof works for Q'. 



< B j^{yki-ykif = B\\y-y'\ 


ki 


Substitution in Theorem [T] immediately gives 

Theorem 4 (subspace MTL) With probability at least 1 — 5 m X the excess risk is 
bounded by 


£: 


avg 


C(X) 


nT 


+ 


C2L^B\^ 


K 


C{X) 


n 


8 In (2/5) 


nT 


( 6 ) 


We remark tha t in t he linear case the best competing bound for MTL, obtained by 


Maurer and Pontill (j2013l l from noncommutative Bernstein inequalities, is 


2B 


\ 


K 

C(X) 

In (Tn) 

1 1 DA 

SK 

C(X) 

oo 1 . 

/8In (2/5) 


nT ' ^ \ 


n 


/ nT 


(7) 


If we disregard the constants this is worse than the bound ([6]) whenever K < In(Tn). Its 
approach to the multitask limit is slower (y^ln (T) /T as opposed to yl/T), but of course 
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it has the advantage of smaller constants. The methods used to obtain ([7]), however, break 
down for nonlinear dictionaries. 

For the LTL setting, we use the distribution dependent bound. Theorem [2] (i), and 
obtain 


Theorem 5 (subspace LTL) 

bounded by 


£r,{h) -£*< V^L^B 


^ K 

V 


c(x) 


Vt 


+ 


least 1 — 5 

in X 

\ 


JK\\C\\^ 

+v 

V n 




/81n {A/6) 


The two most important common features of Theorems S] and [5] are the decay to zero 
of the first term, as T —>• oo, and the occurrence of the operator norm of the empirical or 
true covariances in the second term. The first implies that for very large numbers of tasks 
the bounds are dominated by the second term. 

To understand the second term we must first realize that the ratio of trace and operator 
norms of the true covariances can be interpreted as an effective dimension of the distribution. 
This is easily seen if the mixture of task-marginals is concentrated and uniform on a d- 
dimensional unit-sphere. In this case HCH]^ = 1 and by isotropy all eigenvalues are equal, 
so ||C||j^ = 1/d, whence lie'll / IICII^^ = d. In such a case the second term in Theorem [5] 
above becomes 



( 8 ) 


The appropriate standard b o und for learning the tasks independently would be B^Jljn 
(see iBartlett and Mendelsonl . 120021 ). The ratio \jK/d of the two bounds in the multitask 
limit is the quotient of utilized information (the dimension of the representation space) to 
available information (the dimension of the data). This highlights the potential advantages 
of MTRL: if the data is already low-dimensional in the order of K then multi-task learning 
isn’t worth the extra computational labour. If the data is high dimensional however, then 
multi-task learning may be superior. 

The expression ([8]) above might suggest that there really is a benefit of high dimensions 
for learning-to-learn. This is of course not the case, because the regularizer B has to be 
chosen large, in fact proportional to y/d to allow a small empirical error. The correct 
interpretation of (l8|) is that the burden of high dimensions vanishes in the limit T —>• oo. In 
the next section we will explain this point in more detail. 


3.1 Learning to Learn Half-spaces 

In this section, we illustrate the benefit of MTRL over independent task learning (ITL) in 
the case of noiseless linear binary classification (or half-space learning). We compare our 
upper bounds for LTL to a general lower bound on the performance of ITL algorithms and 
quantify the parameter regimes where LTL is superior to ITL. 

We assume that all the input marginals are given by the uniform distribution a on the 
unit sphere Sd in M'’*, and the objective is for each task /r to classify membership in the 
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half-space {x : {x,u^) > 0} defined by a task-specific (unknown) unit vector u^. In the 
given environment all the vectors are assumed to lie in some (unknown) ii"-dimensional 
subspace M of We are interested in the regime that 

K <^n<€.d 


and T grows. This is the safe regime in which our upper bounds for MTL or LTL (cf. 
Theorems [4] and [5|) are smaller than a uniform lower bound for independent task learning, 
which we discuss below. We need n <C d for the lower bound to be large and iiT <C n for 
the middle term in our upper bounds to be small. If T is large enough, the second term in 
our upper bounds dominates the first (task dependent) term. A safe choice is T ^ K^d, 
see Equation ([9]) below. 

The 0-1-loss is unsuited for our bounds because it is not Lipschitz. Instead we will use 
the truncated hinge loss with unit margin given by i{y',y) = ^{y'y), where ^ is the real 
function 

(I if f < 0, 

^(i) = J i-t if 0 < f < 1, 

0 if 1 < t. 

This loss is an upper bound of the 0-1-loss, so upper bounds for this loss function are also 
upper bounds for the classification error. 

Let % and F be as given at the beginning of Section [3] in its linear variant, where % is 
defined by orthonormal dictionaries without activation functions. Thus, TL can be viewed 
as the set of partial isometries D : H ^ M'^. 

Recall the definition of the minimal risk for LTL 


S* = min 
' hew ^ ' 


mmEz^^£(f{h{X)),Y) 
f ^ ^ 


= min E 
DeH 




rnin Ezr^^,^{{w,DX) sgn{{u^,X))) 
\w\\<B 


Let Dm be the partial isometry mapping M onto . Then Dm £ T~L and for every unit 
vector u G H we have Dm {Bu) G F. Thus 

< E^^r,\^Zr-.^,i{{DM{Bu^) ,DMX)sgn{{u^,X)))] 

= ^ti^rj[^Xr~.aC{B\{Ufj,,X)\)] 

< sup Ex~ff^(S|(n,A)|). 


For any unit vector u G H the density of the distribution of | {u, X) \ under a has maximum 
where is the volume of in the metric inherited from This density can 
therefore be bounded by y/dl2. Thus 

/ oo Aj 

i{B\s\)ds = — = e. 


if we set B = \fd/ (2e). This choice is made to ensure that the Lipschitz loss upper bounds 
the 0-1-loss. 
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Now let Z be a multi-sample generated from the environment r] and assume that we have 
solved the optimization problem ([T|) to obtain the representation (or feature-map) D £T-L. 
Using the excess risk bound, Theorem^ and the fact that = 1/d and HCH^ = 1, we 

get with probability at least 1 — <5 in the draw of Z, that 


£r,{b) 



(9) 


if we optimize e. This guarantees the expected performance of future uses of the represen¬ 
tation D. The high dimension still is a hindrance to the estimation of the representation, 
but, as announced, its effect vanishes in the limit T —>■ oo. The individual samples must 
only well outnumber the dimension K, roughly the number of shared features. 

We compare this upper bound to a lower bound for a large class of algorithms which 
learn the tasks independently. 


Definition 6 An algorithm f : Sd x {—1,1}"’ —^ Sd is called orthogonally equivariant if 
f (Ux, y) = U/ (x, y) , for every orthogonal matrix V € (10) 


For data transformed by an orthogonal transformation an orthogonally equivariant al¬ 
gorithm produces a correspondingly transformed hypothesis. Any algorithm which does not 
depend on a specific coordinate system is orthogonally equivariant. This class of algorithms 
includes all kernel methods, but it excludes the Lasso (Ll-norm regularization). If the 
known properties of the problem posses a rotation symmetry only equivariant algorithms 
make sense. 

Below we denote by err(n, v) the misclassification error between the half-spaces associ¬ 
ated with unit vectors u and u, that is err(u, u) = PrT,^ n-i(u, x)(u, x) < 0}. The following 
lower error bound is given in ( Maurer and Pontii 20081 ). 


Theorem 7 Let n < d and suppose that f : x { — 1,1}” —>• Sd is an orthogonally 

equivariant algorithm. Then for 5 > 0 with probability at least 1 — 6 in the draw of^^ cr^ 
we have for every u G Sd that 


err tt, 


/(X,u(X))) >- 


' d — n 


TT 


d 


In (1/5) 


d 


where u (X) = {sgn{u, Xi ),..., sgn{u,Xn)). 


If we use a union bound to subtract the upper bound Q from this lower bound we 
obtain high probability guarantees for the advantage of representation learning over other 
algorithms. 

In the following section we plot the phase diagram derived here, namely the difference 
between the uniform lower bound and our upper bound, and compare it with empirical 
results (see Figure 0]). 
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3.2 Numerical Experiments 

The purpose of the experiments is to compare MTL and LTL to independent task learning 
(ITL) in the simple setting of linear feature learning (or subspace learning(0. We wish to 
study the regime in which MTL/LTL learning is beneficial over ITL as a function of the 
number of tasks T and the sample size per task n. 

We consider noiseless linear binary classification tasks, namely halfspace learning. We 
generated the data in the following way. The ground truth weight vectors ui,... ,ut are 
obtained by the equation ut = Dct, where ct G is sampled from the uniform distribution 
on the unit sphere in , and the dictionary D G is created by first sampling a 

d-dimension orthonormal matrix from the Haar measure, and then selecting the first K 
columns (atoms). We create all input marginals by sampling from the uniform distribution 
on the y/d radius sphere in For each task we sample n instances to build the training 
set, and 1000 instances for the test set. 

We train the methods with the hinge loss function h[z) := max{0, 1 — z/c}, where c 
is the margin. We choose c = 2/e, so that the true error relative to the best hypothesis 
is of order e. We fixed the value of e to be (LC/n)^/^. For ITL we optimize that loss 
function constraining the ^ 2 -norm of the weights, for MTL and LTL we constrain D to have 
a Frobenius norm less or equal than 1, and each ct is constrained to have an (.2 norm less 
or equal than 1. During testing we use the 0-1 loss. For example the task-average error is 
evaluated as 



( 11 ) 


where ut are the weight vectors learned by the assessed method. 


3.3 MTL Experiment 

We first discuss the MTL experiment. We let d = 50, and vary T G {5,10,... , 150}, 
n G {5,10,... , 150} considering the cases K = 2 and K = 5. In Figure [1] we report the 
difference between the classification error of the two methods. These results are obtained by 
repeating the experiment 10 times, reporting the average difference. In each trial a different 
set of input points and underlying weight vectors are generated for each task. In the MTL 
case the training error was always below 0.1 and on average it was smaller than 0.04. This 
suggests that despite the problem being non-convex, the gradient optimization algorithm 
finds a good suboptimal solution. 

We have made further experiments to assess the influence of other data settings on the 
difference between ITL and MTL. In the first of those experiments we have explored the 
cases in which the dictionary size is overestimated and underestimated. The results are 
shown in Figure [2j In the left plot the dictionary size is overestimated, in particular the 
ground truth number of atoms is 2, and the number of atoms used in the MTL method is 
5. We can appreciate a similar pattern as the one we saw in Figure [H although differences 
between ITL and MTL are not as high. The performance is slightly hampered, as expected 
due to an overestimation of the number of atoms. On the other hand in Figure [2] (right) we 

1. The code used for the experiments presented in this section is available at 
http://romera-paredes.com/multitask-representation 
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Figure 1: Difference of test classification error, computed according to eq. (ttH), between 
ITL and MTL. The vertical axis represents the number of training tasks, and 
the horizontal axis the number of training instances per task. In the left column 
K = 2, and in the right column K = 5. 



Figure 2: Difference of test classification error, computed according to eq. (ttH), between 
ITL and MTL, when the number of atoms of the ground truth dictionary does 
not match the number of atoms of the MTL model. The plot in the left shows the 
experiment in which the ground truth number of atoms is 2, whereas the number 
of atoms used in the MTL approach is 5. The plot in the right shows the opposite 
scenario: 5 atoms as ground truth, and 2 atoms in the MTL model. The vertical 
axis represents the number of training tasks, and the horizontal axis the number 
of training instances per task. 


show the results when the number of atoms in the ground truth dictionary is 5, whereas the 
number of atoms used in the MTL approach is 2. In this case we see that the performance 
is severely affected by the underestimation of the size of the dictionary, yet we observe that 
MTL performs better than ITL in the same regime as in the previous experiments. 
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Figure 3: Difference of test classification error, computed according to eq. (ttH), between 
ITL and MTL, when adding Gaussian noise to the ground truth labels. The 
vertical axis represents the number of training tasks, and the horizontal axis the 
number of training instances per task. 


In the second of these experiments we study how the results are affected when the data 
are noisy. To do so, we have generated the data so that the ground truth label for instance 
Xi for task t is given by sign{{ut,Xi) +Sti), where eu ~ AA(0,1). The dictionary size, for both 
the ground truth and the MTL approach, is iL = 2. The results are shown in Figure [3l and 
we can see a similar behaviour as the one in Figure [TJ with somewhat smaller differences 
between ITL and MTL. 

3.4 LTL Experiment 

In this experiment we test how the dictionary learned at the training stage helps learning 
new tasks, and we assess how similar the resultant hgure is in comparison to the phase 
diagram derived in the previous section. 

The data is generated according to the settings given in the MTL experiment. Fur¬ 
thermore, 50 new tasks are sampled following the same scheme previously described for the 
purpose of computing the LTL test error. We present the results in Figured (Top). Similar 
to the previous experiment, we report the average difference between the test error of ITL 
and LTL after 10 trials. 

In Figure [4] (Bottom) we present the theoretical phase diagram, which was generated 
using 1 < T < 10^^, 1 < n < 10^, d = 10®, 6 = 0.0001. We also plot as a dark line the 
points in which there is no difference in the performances between ITL and LTL. 

The reader may object about the much larger parameter values used to generate the 
plots of theoretical differences, in comparison to the experimental settings. These large 
parameters are partly a consequence of an accumulation of somewhat loose estimates in the 
derivation of both the upper and lower bounds. Another reason is that in applying it to 
a noiseless, finite-dimensional problem (for clarity) we have sacrificed two strong points of 
our results: independence of input dimension and its agnostic nature. Apart from the large 


18 







The Benefit of Multitask Representation Learning 



Figure 4: The vertical axis represents the number of training tasks, and the horizontal 
axis the number of training instances per task. Plots in the top row show the 
difference of test classification error, computed on 50 new tasks, between ITL 
and LTL. Plots in the bottom row show the region where the upper bound for 
LTL is smaller than the lower bound for any equivariant algorithm for ITL (see 
the discussion in Section 3.1, in particular Equation E]) using 1 < T < 10^^, 
1 < n < 10^ , d = 10^, and 6 = 0.0001. In the left column K = 2, and in the 
right column K = 5. 


parameter values the theoretical prediction shown in Figure [4] (Bottom) is in very good 
agreement with the experimental results in Figure 0] (Top). 

We have also performed experiments in order to evaluate the influence of noise and 
under/overestimation of the dictionary size on the difference between ITL and LTL. We 
obtained similar results as the ones reported for MTL in Figures [2] and [3l 

Finally, we have compared the learned dictionary, D, with the ground truth, D, in the 
same regime of parameters used for the previous experiments. Note that a dictionary could 
be correct up to permutations and changes of sign of its atoms. To overcome this issue we 
use the similarity measure 


s{D,D) 


1 

K 


D^b 


tr 


( 12 ) 
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Figure 5: Similarity between the learned dictionary D and the ground truth dictionary D, 
according the similarity measure s{D,D) in Equation (1121) . The vertical axis 
represents the number of training tasks, and the horizontal axis the number of 
training instances per task. Left plot: K = 2. Right plot: K = 5. 

where || • ||tr is the sum of singular values of a matrix. Note that s{D,D) = 1 if H and D 
are the same matrix up to permutation of columns and changes of sign, as requested. The 
results are found in Figure O 

Figure El indicate that the learned dictionary is close to the true dictionary even for small 
sample sizes, provide T is large. This supports the results in Figure [Hand the top plots in 
Figure m where MTL or LTL are found to be superior to ITL in this regime, respectively. 

4. Proofs of the Main Theorems 

In this section we prove our principal results, Theorem [1] and Theorem [2l In preparation 
for the proofs we will first present some important auxiliary results. 

4.1 Tools 

We denote by 7 a generic vector of independent standard normal variables, whose dimension 
will be clear from context. A central role in this paper is played by the Gaussian average 
G(F) of a set y C M”, which is defined as 

n 

G (F) = E sup ( 7 , ?/) = E sup V -fiVi. 

y&Y y &^ 

The reader who is concerned about the measurability of the random variable on the right 
hand side should replace F by a countable dense subset of F, with similar adjustments 
wherever the Gaussian averages occur. 

Rademacher averages, where the 7 j are replaced by uniform { — 1, l}-distributed vari¬ 
ables, are somewhat more popular in the literature. We use Gaussian averages instead, 
because in most cases they are just as easy to bound and possess special properties (Theo¬ 
rem [TO] and Theorem 1121 below! which we need in our analysis. 


20 













The Benefit of Multitask Representation Learning 


The first result is a stan dard tool to prove uniform bo unds on the estimation error in 
terms of Gaussian averages ( Bartlett and Mendelson . 2002lh 

Theorem 8 Let J- he a real-valued function class on a space X and let X = [Xi, 
he a vector of independent random variables and X' iid to X. Then 
(^) 


1 

Ex sup(Ex' [/(X')] -/(XO) 


< 


V^ExG(X(X)) 




2=1 


n 


(a) if the members of F have values in [0,1] then with probability greater than 1 — 5 in 
X for all f gF 


2=1 


The following theorem is a vect or-valued version of the above, is useful for bounds on 
the task-averaged estimation error ( Ando and Zhang ( 2005 1. Maurer ( 2006bl l). 

Theorem 9 Let F be a class of functions / : X —)> [0, and let ■■■iIIt he probability 

measures on X with X = (Xi, nr=i where X^ = {Xti, ...,Xtn)- Then with 

probability greater than 1 — d m X for all f G F 


f E I/, (x)i - - E * s 


^/^G (Y) /91n (2/5) 


nT 


+ 


2nT 


where Y C is the random set defined by Y = {{ft {Xu)) : f G F} . 


The previous two theorems replace the problem of proving uniform bounds 
lem of bounding Gaussian aver a ges. One key result in the la tter direction 
Slepian’s Lemma ( Sleoianl ( 1962l l. iLedoux and Talagrandl ( 199 fil l. 


by the prob- 
is known as 


Theorem 10 Let Ll and H be mean zero, separable Gaussian processes indexed by a common 
set S, such that 


E (Osi - < E (“si - for all si, S 2 G S. 


Then 


E sup ris < E sup Hs. 
s&S s€iS 


The following corollary is the key to our bound for LTL. 


Corollary 11 Let Y C R"- and let cj) : Y —)■ be (Euclidean) Lipschitz with Lipschitz 

constant L. Then 


G{f{Y))<LG{Y). 
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Proof Define two Gaussian processes indexed by Y as 

m n 

^y = Yl {y)k and 'Ey = l'^ 

k=l i=l 

with independent 7 ^ and 7 '. Then for any y,y' gY 

E {Qy - ^y,f = !!</) (y) - [y') If < ||y - y'|f = E (H,, - E^.f , 

so that, by Slepian’s Lemma, 

G[(t){Y)) =EsupDj^ <Esup“y = LG(y). 

y&Y yeY 


In many applications this is applied when n = m and (p is defined by <p {yi, ...,yn) = 
!>]^ (yi), iVn)) where the real functions cpi, have Lipschitz constant L. 

At one point we will need a generalization of the above corollary, which allows to select 
(p fro m an ent i re cla ss of Lipschitz functions. We will use the following result, which is taken 
It will play an important role in the proof of Theorem 1131 below. 


Theorem 12 Let Y C R”' have (Euclidean) diameter D (Y) and let E be a class of func¬ 
tions f : Y ^ R"*, all of which have Lipschitz constant at most L[E). Then for any 
yo^Y 

G {E (T)) < ciL {E) G {Y) + C 2 D {Y) Q{E)+G {E (yo)), 


where ci and C 2 are universal constants and 


Q{Y) 


sup E sup 
y.y'GL y^y' f&T 


ilJ (y) - f {y')) 

l|y - y'll 


Note that the result allows us to minimize the right hand side in y^. Analogs of Theorem 
HQ] and Theorem m are not available for Rademacher averages. This is the reason why we 
use the slightly more exotic Gaussian averages. 


4.2 Proof of the Excess Risk Bound for the Average Risk 

We first establish the following uniform bound. It is of some interest in its own right, in 
particular since the problem ([1]) is often non-convex, so that the excess risk bound may not 
be meaningful in practice. Recall the definition of Q given in Equation ([ 4 ]). 

Theorem 13 Let be probability measures on Z and let Zti,..., Ztn be i.i.d. 

from y-i, for t = 1,... ,T. Let 5 G (0,1). With probability at least 1 — 6 in the draw of a 
multisample Z, it holds for every h gLL and every fi,..., fT G E that 

Tavg (h, /l, ..., 

ti 

^ LG{n{-X)) , gsup;,g^||/i(X)|| ^ /91n(2/,5) 

nT - WT -+ 

where ci and C 2 are universal constants. 


22 











The Benefit of Multitask Representation Learning 


Proof By Theorem[9l with probability at least 1—(5 in Z, for all h G "H and all /i ,/t G -7^, 
we have that 



ti 


where S = {{i {ft {h {Xu)) ,Yu)) : f ^ and h G C By the Lipschitz property of 
the loss function i and the contraction lemma Corollary [11] (recall the remark which follows 
its proof) we have G {S) < G {S'), where S' = {{ft {h {Xu))) : f G and h G T-i] C 
Recall that T-L (X) C is dehned by 

iR(x) = {(hfc {Xu))-.hen}, 
and define a class of functions F' : by 

F' = {ve FT {ft {yu)) : (/i, /t) e F^} . 

Then S' = F' {fH (X)), and by Theorem 1121 for universal constants and Cg 

G {S') < c'^L {F') G {n (X)) + c'^D {% (X)) Q {F') + m.\nG{F {y)) . (14) 


We now proceed by bounding the individual terms in the right hand side above. Let 


y,y' G where y = {yu) with yu G and y' = {y'u) with y'^- G . Then for 

/ = (/i,-,/t)G-T^ 


f{y)-f{y')f = {ft {yu) - ft {y'u))^ 


ti 



ti 


so that L {F') < L. Also 


E sup {'y,g{y) - g {y')) 

na T' 


E sup ^lu{ft{yti)-ft 
u 


'^'iti {ft{yu)-ft{yu)) 



E sup Y Ti (/ {yu) - f {y'u)) 


111/ - y'll, 
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whence Q = VTQ. Finally we take yo = 0 and the last term in (I14p vanishes since 
/ (0) = 0 for all f G T. Substitution in (fT41) and using G (5) < G (S') we arrive at 

G (5) < c[LG {n (X)) + c'^VtD {n (X)) Q. 

Bounding D (T-L (X)) < 2sup;j \\h (X)|| and substitution in (fTBI) gives the result. ■ 


Proof of Theorem [T] Let h* and be the minimizers in the definition of £. 

Then 


* 

avg- 


( ^avg(/i, A, 


+ 


+ 


(Xu)),Yu) 


\ ti ti 

(f* ^ E (ft (h* (X)), T)) . 

\ ti t J 


The last term involves only the nT random variables {h* (Xu)) ,Yu) with values in 
[0,1]. It can be bounded with probability 1 — <5/2 by y^ln {2/6) / (2Tn) using Hoeffding’s 
inequality. The middle term is non-positive by definition of h, fi ,/t being the corre¬ 
sponding minimizers. There remains the first term which we bound by 


sup Tavg(/i,/l,...,/T) - 


and appeal to Theorem [13] to bound the supremum. A union bound then completes the 
proof. ■ 


4.3 Proof of the Excess Risk Bound for Learning-to-learn 

Recall the definition of the algorithm parametrized hy h gT-L 

a {h)^ = argmini i{f {h{Xi)) ,Yi) for Z G Z'^ 

I 

and the associated minimum m^h)^. Also recall that 

£ri {h) = (^)z (X) j Y) 

and the two measures /i^ and induced by the environment rj and defined by 
(A) = (A) for A C Z and (A) = (A) for A C Z'^. 

Also recall the definition of Q' given in Equation ([5|). Again we begin with a uniform bound. 
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Theorem 14 Let <5 G (0,1). (i) With probability at least 1 — 6 in Z ~ it holds for every 
h ^LL that 


£rjih) - ^ 


V^LGinm , ^ IE, 

—-h V2ttQ sup \ — 

Ty/n \ 




\\h{X)r 


+ 


n 


9 In (2/(5) 

2 r ' 


(a) With probability at least 1 — 5 in Z ~ it holds for every h gH that 


t 

y/^LG {n (x)) V^Q'T,t^^PhGH ll^(Xt)|| 

Ty/n nT 


16 In (4/(5) 


Proof The key to the proof is the decomposition bound 

sup £r,ih) - (/i)z^ < sup [E(^x,YWf^^ (“ (^)z (^) ,Y)-m {h)r 


hen 


h£H 


+ sup 
hen 


Ez~p^ [m{h)z] - J]]m(/i) 


t=i 


(15) 


In turn we will bound both terms on the right hand side above. A bound on the second 
term means that we can predict the empirical risk on the data of a future task uniformly 
in h. A bound on the first term means that we can predict the true risk from the empirical 
risk on the future task. 

We first bound the second term in the right hand side of (IlSp . and use Theorem [8}-(ii) 
on the class of functions 

{z G Z"' !-)• m {h)^ : h G LL} 
to get with probability at least 1 — (5 in Z ~/o^ that 


sup 

h^H 


[m (h)z] 


T 




t=i 


< ^G{S) + 


9 In (2/(5) 

2 r ^ 


where S is the subset of defined by 

5 = I (^m (/i)z^ ,..., m (/i)z^) : /i G -h} . 

We will bound the Gaussian average of S using Slepian’s inequality (Theorem [TU]) . Define 
two Gaussian processes indexed by H as 

^h = Y^ {h)^^ and Ikti^k {xu) ■ 
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Now for any z £ and representations h,h' ^T-L 




< 


< 


< 


~ if (^*)) ’ 2/0 

\ i i 

(sup-'^e{f {h (Xi)) ,yi)-£ (/ (h' (x*)) , 

V/6-^ ” i / 

^ sup^ (i(f (h{xi)) ,yi) -£{f {h! (x*)) ,yi)f 

^'^{hk {xi)-h'k{xi)f , 


where in the last step we used the Lipschitz properties of the loss function ^ and of the 
members in the class J-. It follows that 

t 

< E i^>^ (^*0 - K {xti)f = EiEk- Ek:f , 

kti 


so by Theorem [To] 


G (S) = E sup ri/i < E sup Eh 

k k 


-^G(«(x)) 

'n 


The second term in the right hand side of ()15p is thus bounded with probability 1 — 5 by 


V^LG {n (x)) ^ ^ /9ln{2/6) 




2T 


(16) 


We now bound the first term on the right hand side of (I15p by 


sup E^^^Ez--.^" \K(^x,YZh^ (a (/i)x (^) ,Y) - m {h)^] 
h&n 


< sup E^,^^Ez~^" sup 
h€H /GJE 


1 


E(x,y) V ifih{X)),Y)--J2i (/ (h (^0), 


For Z = (X, Y) £ and h gT-L denote with 1{F o h (X), Y) the subset of defined by 
I {F {h (X)), Y) = {{i if {h (W)) ,Yi)):fGF}. 
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Using Theorem [8j-(i) and the contraction lemma, Corollary [TTl we can bound the last 
expression above by 


< 


< 


< 


sup sup 

h^H /gJF 


sup Ez- 


G{i{X{h{X)),Y)) 


hen 


n 


y/^ sup Ez- 


G{F{h{X))) 


hen 


n 


-supEz~p — 1|, — Ki X 

n hen ^ II^WII 

supEz-p ||/i(X)|| , 

n hen ^ 


i 


using Hoelder’s inequality and the definition of Q' in the last step. But, using Jensen’s 
inequality. 


Ezr- 


(X)||< /Ez^p^J;||/i(X,) 


= \ n E 




\\h{XW 


since Z ~ is iid. Inserting this in the previous chain of inequalities and combining with 
(|16h gives the first part of the theorem. 

To obtain the data dependent bound we use the fact that, with probability at least 
1 - J/4, 


sup Ez~p^ 
hen 


G{i{F{h{X)),Y)) 

n 


< 


Ex~p„ sup 
hen 


G{i{F{h{X)),Y)) 

n 


(17) 


< 


1 ^ G{l{F{h{Xt)),Yt)) 

- > sup - 

T ^ hen n 


In (4/J) 


2T 


(18) 


The last inequality follows from Hoeffding’s inequality since for any /i G and any sample 
Z G Z” 


0 < g(^(^(MX)),Y)) ^l ^ 

n n f 

J i 


^ SY(^Hfih{X,)),YA < 1 , 


where we have also used the fact that the loss function i has range in [0,1]. Bounding 

Gie{FohiXt),Yt))<Q'\\h{Xt)\\ 


as above and combining (flBI) and (fTUl) in (fTCI) with a union bound gives the second inequality 
of the theorem. 
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Remark: In the proof of the fully data-dependent part above the bound on 


sup Ez~p 
hen 


G{i{T{h{X)),Y)) 

n 


is very crude. Instead we could have again invoked Theorem [8] to get a better bound with 
a more complicated expression involving nested Gaussian averages. We have chosen the 
simpler path for greater clarity. 


Proof of Theorem [2] Recall that 


S* = minE 
^ h&H 






We denote with h* the minimizer in T-L occurring in the definition of £*. We have the 
following decomposition 


£r,Ch)-£; 



1 

T 





+E 




Ez-m" (^*)z] - minE(x,Y)~^^ (/ {h* (X)), Y) 

J 


(19) 

( 20 ) 
( 21 ) 
( 22 ) 


For a fixed distribution ^ let f* be the minimizer in (f (^)) ^X). By 

definition of m {h*)r^ we have for every ^ ~ that 


- lib 

= Ez~/.nmin-V^(/(/i*(W)),i"*) 

feT m 

2=1 

- m 

< Ez^^n-j;£(/;(h*(X,)),T0 

m ^^ ^ 

2=1 

since Z is iid. The term in ()22p is therefore non-positive. 

The term in (I2ip involves the deviation of the empirical and true averages of the T iid 
[0,1]-valued random variables With Hoeffding’s inequality this can be bounded 

with probability at least 1 — d/S by y^ln (8/(5) / (2T). The term ([20]) is non-positive by the 
definition of h. 

There remains the term (1191) , which we bound by Theorem [TTl The result now follows 
by combining this bound with the bound on (j2ip in a union bound and some numerical 
simplifications. ■ 


28 








The Benefit of Multitask Representation Learning 


5. Conclusion 

Several works have advocated that sharing features among tasks as a means to learning 
representations which capture invariant properties to tasks can be highly beneficial. In 
this paper, we studied the statistical properties of a general MTRL method, presenting 
bounds on its learning performance in both settings of MTL and LTL. Our work provides a 
rigorous justihcation of the benefit offered by MTRL over learning the tasks independently. 
To give the paper a clear focus we have illustrated this advantage in the case of linear 
feature learning. Our results however apply to fairly general classes of representations % 
and specifications J-, and similar conclusions may be derived for other nonlinear MTRL 
methods. We conclude by sketching specific cases which deserve a separated study: 

• Deep networks. As we noted our bounds directly apply to multilayer, deep archi¬ 
tectures obtained by iteratively composing linear transformations with nonlinear ac¬ 
tivation functions, such as the rectifier linear unit or the sigmoid functions. The 
representations learned by such methods tend to be specific in that only a subset of 
components are “active” on each given input, which makes our bounds particularly 
attractive for further analysis. 


Sparse coding. Another interesting case of our framework is obtained when the spe- 
ci alized class J- consis t s of sparse linear predict ors. This case has been considered 
in iMaurer et al.l ( 2ni,'ll ): Ruvolo and Eaton ( 20141 ) when the representation class 


con¬ 


sists of linear functions. Different choices of sparse classes T could lead to interesting 
learning methods. 


• Representations in RKHS. As we already noted the feature maps forming the class 
R could be vector-valued functions in a reproducing kernel Hilbert space. Although 
kernel methods are more difficult to apply to large datasets required for MTRL and 
need additional approximation steps, the representations learned using for example 
Gaussian kernels would be very specific and suitable for our bounds. 
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